Automat ic Extraction of Word Sequence Correspondences in Parallel Corpora
ثبت نشده
چکیده
This paper proposes a method of finding correspondences of arbitrary length word sequences in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are evaluated by a similarity measure between the sequences defined by the co-occurrence frequency and independent frequency of the word sequences. The similarity measure is an extension of Dice coefficient. An iterative method with gradual threshold lowering is proposed for getting a high quality translation dictionary. The method is tested with parallel corpora of three distinct domains and achieved over 80~0 accuracy. 1 I n t r o d u c t i o n A high quality translation dictionary is indispensable for machine translation systems with good performance, especially for domains of expertise. Such dictionaries are only effectively usable for their own domains, much human labour will be mitigated if such a dictionary is obtained in an automatic way from a set of translation examples. This paper proposes a method to construct a translation dictionary that consists of not only word pairs but pairs of arbitrary length word sequences of the two languages. All of the pairs are extracted from a parallel corpus of a specific domain. The method is proposed and is evaluated with Japanese-English parallel corpora of three distinct domains. Several attempts have been made for similar purposes, but with different settings. (see [Kupiec 93][Kumano & Hirakawa 94][Smadja 96]) Kupiec and Kumano ~ Hirakawa propose a method of obtaining translation patterns of noun compound from bilingual corpora. Kumano & Hirakawa stand on a different setting from the other works in that they assume ordinary bilingual dictionary and use non-parallel (non-aligned) corpora. Their target is to find correspondences not only of word level but of noun phrases and unknown words. However, the target noun phrases and unknown words are decided in the preprocessing stage. Brown et al. use a probabilistic measure for estimating word similarity of two languages in their statistical approach of language translation [Brown 88]. In their work of aligning of parallel texts, Kay & RSscheisen used the Dice coefficient as the word similarity for insuring sentence level correspondence [Kay & RSscheisen 93]. Kitamura & Matsumoto use the same measure to calculate word similarity in their work of extraction of translation patterns. The similarity measure is used as the basis of their structural matching of parallel sentences so as to extract structural translation patterns. In texts of expertise a number of word sequence correspondences, not word-word correspondences, are abundant especially in the form of noun compounds or of fixed phrases, which are keys for better performance. Though the method proposed in this paper deals only with consecutive sequences of words and is intended to provide a better base for the structural matching that follows, the results themselves show very useful and informative translation patterns for the domain. Our method extends the usage of the Dice coefficient in two ways: It deals not only with correspondence between the words but with correspondence between word-sequences, and it modifies the formula measure so that more plausible corresponding pairs are identified earlier.
منابع مشابه
Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining
We present an unsupervised extraction of sequence-to-sequence correspondences from parallel corpora by sequential pattern mining. The main characteristics of our method are two-fold. First, we propose a systematic way to enumerate all possible translation pair candidates of rigid and gapped sequences without falling into combinatorial explosion. Second, our method uses an efficient data structu...
متن کاملAutomatic Extraction of Word Sequence Correspondences in Parallel Corpora
This paper proposes a method of finding correspondences of arbitrary length word sequences in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are evaluated by a similarity measure between the sequences defined by the co-occurrence frequency and independent frequency of the word sequences. The similarity measure is an extension of Dice coefficient. An i...
متن کاملExtracting Word Sequence Correspondences with Support Vector Machines
This paper proposes a learning and extracting method of word sequence correspondences from non-aligned parallel corpora with Support Vector Machines, which have high ability of the generalization, rarely cause over-fit for training samples and can learn dependencies of features by using a kernel function. Our method uses features for the translation model which use the translation dictionary, t...
متن کاملComputational Lexicography and Lexicology Elexbi, a Basic Tool for Bilingual Term Extraction from Spanish-Basque Parallel Corpora
We present the work done by Elhuyar Foundation in the field of bilingual terminology extraction. The aim ofthis work is to develop some techniques for the automatic extraction ofpairs ofequivalent terms from Spanish-Basque translation memories, and to implement those techniques in a prototype. Our approach is based on a monolingual extraction of term candidates in each language, then the creati...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002